Background and Context:

Product categorization also referred to as product classification, is a field of study within natural language processing (NLP). It is also one of the biggest challenges for e-commerce companies. With the advancement of AI technology, researchers have been applying machine learning to product categorization problems.

Product categorization is the placement and organization of products into their respective categories. In that sense, it sounds simple: choose the correct department for a product. However, this process is complicated by the sheer volume of products on many e-commerce platforms. Furthermore, many products could belong to multiple categories. There are many reasons why product categorization is important for e-commerce and marketing. Through the accurate classification of your products, you can increase conversion rates, strengthen your search engine, and improve your site’s Google ranking.

A well-built product taxonomy allows customers to find what they are looking for quickly and easily. Making your site easy to navigate is one of the most important elements of your UX and will lead to higher conversion rates. Correctly categorizing products allows your search engine to fetch products quicker. As a result, you create a quicker and more accurate search engine. Once you have a strong product taxonomy in place, this will allow you to create the relevant landing pages for your products. In turn, Google and other search engines will be able to index your site and your products more easily. In the end, this allows your products to rank higher on search engines, increasing the chance that customers find your site.

To help merchants choose the correct category, Amazon and other e-commerce companies have automated product categorization tools available. After simply inputting the title or a few words about the product, the system can automatically choose the correct category for you.

Dataset:

The dataset has the following features:

● Data Set Characteristics: Multivariate

● Number of Instances: 50424

● Number of classes: 4

Objective:

To implement the techniques learned as a part of the course.

The dataset has plenty of unnecessary columns. So, I will choose only the relevant columns.

There are 50428 rows and 2 columns

There are 22623 duplicated observations

There are 27805 rows and 2 columns.

Strip html tags

Seven (7) labels are showing instead of four (4). I will fix this shortly.

The label contains some noise which would be cleaned to make sure the number of labels is 4.

Shape of one of the test data in this project is 27805 rows and 1 column.

Missing Value analysis

No missing or null label and text.

The below function, will give these output in Dataframe

There are 27805 observations, 27804 unique text ( or observations). Majority of the Labels is for household

We now have 4 unique labels as expected.

For the Household label, the majority of characters in the Text column for each observation is less than 2000 characters; For the Books label, the majority of characters in the Text column for each observation is less than 5000 characters; For the Clothing & Accessories label, the majority of characters in the Text column for each observation is less than 750 characters; For the Electronics label, the majority of characters in the Text column for each observation is less than 2000 characters.

The Household and Electronics labels have similar characters distribution.

For the Household label, the majority of words in the Text column for each observation is less than 250 words; For the Books label, the majority of words in the Text column for each observation is less than 1000 words; For the Clothing & Accessories label, the majority of words in the Text column for each observation is less than 1500 words; For the Electronics label, the majority of words in the Text column for each observation is less than 500 words.

The Household and Electronics labels have fewer number of words in each observation.

So, the most current data after removing duplicates and before removing noises has 27802 rows and 2 columns.

Preprocessing Texts

Pre processing: Remove https

Pre processing: De-contraction of words

Pre Processing : Removal of number

Pre Processing - Tokenization

Preprocessing-Lowercase

Preprocessing: Removal of Punctuation

Pre-Processing - Removal of stopwords

Pre-Processing: Lemmatization

Pre-Processing: Remove Non-ASCII

Pre-Processing: Remove apostrophe or other non-lowercase alphabets after other preprocessing

Pre-Processing: Auto-correct wrongly spelt words

Update data before normalizing

Text preprocessing-Normalization(aggregating pre-processing function into one)

.The most Common words in text column

Most common words in the labels

Create dataframes for each label

Finding common words for household labelled texts

Finding common words for book labelled texts

Finding common words for Clothing and Accessories labelled texts

Finding common words for Electronics labelled texts

Create Wordcloud where the more common words appear bigger in size

Majority of the labels are 'Household' (37.99%), followed by 'Books'(22.50%), then 'Clothing & Accessories' (20.41%), and the least label is 'Electronics' (19.09%).

Building the model based on CountVectorizer and Classifiers

Re-label the Labels on the dataset using numbers

Create class weight to balance the dataset as some labels appear more than the other

The Random Forest Classifier performs better based on it's cross validation score of about 92%, followed by the Logistic Regression model with a cross validation score of about 91%, while the Decision Tree Classifier gave the least score of about 86%. Thus, we will proceed with the Random Forest Classifier to build the model.

Optimizing the parameter: Number of trees in the random forest model (n_estimators)

Misclassification error is minimum for n_estimators =19 with an error of 8.7%

Count vectorizer predicted with accuracy of 91.5%

Word Cloud of top 20 important features from the CountVectorizer + Random Forest based model

Visualize more common words using count vectorizer

The more common words are women, men, author, cotton, book, usb, wear, and steel for countvectorizer

Term Frequency(TF) - Inverse Document Frequency(IDF)¶

Data split was stratified to maintain proportionate distribution of the labels for every split and maintain the class weight.

n-estimators = 15 as optimal_learner with misclassification error of 8.76% for tf-idf vectorizer.

tf-idf vectorizer predicted with accuracy of 91.7%

More common words using tf-idf are book, women, cotton, men, camera, black, fit, usb, author, home, stainless, wear, and write.

Predictions based on test set with visualizations on dataframe

TF-IDF Vectorizer predicted the test data with 92% accuracy, precision, recall, and f1-score.

Count Vectorizer predicted the test data with 91% accuracy, recall, and f1-score; and 92% precision score.

Therefore, TF-IDF Vectorizer predicted with slightly higher accuracy, recall and f1-score. The closeness in the scores for each of the vectorizer is due to the properly assigned class weights

View the predicted test data in dataframes showing original and clean texts, and predictions

TF-IDF Vectorizer predicted with overall accuracy of 91.68%, wwhile count vectorizer predicted with overall accuracy score of 91.49%. Thus, TF-IDF vectorizer gave a more accurate prediction

Predictions using further test sets created from original dataset with visualizations on dataframes

Conclusion:

Summary and Recommendations: